Improve priority-based ordering for recompute #20117

pengwa · 2024-03-28T04:58:56Z

Improve priority-based ordering for recompute

Critical Path Impact in Priority-Based Topological Sort

Setting and comparing critical path impact is essential in scenarios where all nodes in the priority queue
are of low priority. This comparison helps determine which recompute node to select to unblock the backward
critical path. Consider a scenario where:
- A recompute node subgraph, NodeSubgraph-N, exists within transformer layer N (e.g., NodeSubgraph-5 in
layer 5, NodeSubgraph-3 in layer 3).
- Node-A-IN-5 within NodeSubgraph-5 depends on Node-B-IN-0 within NodeSubgraph-0.
- The priority queue contains nodes from NodeSubgraph-0 to NodeSubgraph-5.
In MemoryOptimizer recompute scenarios, we append nodes starting from NodeSubgraph-5 down to NodeSubgraph-0.
Relying solely on node index for comparison could lead to:

Sequential output of nodes in NodeSubgraph-5 (sorted by ascending node index within the subgraph).
Blocking of Node-A-IN-5's execution until Node-B-IN-0 is executed.
Sequential output of nodes from NodeSubgraph-4 to NodeSubgraph-1.
Execution of NodeSubgraph-0 nodes, allowing Node-A-IN-5 and subsequent NodeSubgraph-5 nodes to execute.
Execution of remaining NodeSubgraph-0 nodes.

This process can significantly delay the execution of Node-A-IN-5, blocking other NodeSubgraph-5 nodes.
Since NodeSubgraph-5 nodes are on the critical path, triggering their dependencies timely is crucial to
ensure their execution as early as possible, ahead of other layers. This necessity led to the introduction
of critical path impact.

Defining Critical Path Impact
Critical path impact is a metric representing a node's influence on the critical path. It is determined
during MemoryOptimizer's operation as follows:
1) Sort graphs without recompute optimization to establish a baseline topological order.
2) Apply recompute optimization.
3) Identify recompute boundary nodes (recompute nodes not consumed by others).
4) For each boundary node, calculate the minimum topological order of all output nodes.
The minimum value indicates the earliest need for the recompute node's execution.
We assign std::numeric_limits<int64_t>::max() - min_topological_order as the critical path impact.
5) For other recompute nodes, assign the maximum critical path impact of their output nodes.

DEFECTs

The change on priority based topo sort, give up the priority for the recompute node orders. And it need firstly a topo sort without applying mem opt, and the first sort might affect the recompute graph a lot. Here is one example:

Some nodes that are not contributing to YieldOp is treated as a non-forward ops, so it will be scheduled after YieldOp, and is possible be scheduled early, then the dependent recompute graph will be scheduled early too. This is not always most optimized.

Motivation and Context

Mistral models cannot run user recipes when enabling recompute. The root cause is, the execution order of recompute are not correct, making the memory saving very limited.

…l = 1 (layer wise recompute)

…pengwa/recompute_in_critical_path

This reverts commit 3e03d1e.

…or input leaf nodes

pengwa · 2024-04-08T17:11:05Z

Will be split into PRs.

pengwa · 2024-04-08T17:11:18Z

The first one: #20234

pengwa added 11 commits March 27, 2024 00:43

keep original name as much as possible during fusion

8fc0c41

minor

bf572c4

fix boudary detect

27b872f

Add warning if the boudary node is not found if memory optimizer leve…

1330a40

…l = 1 (layer wise recompute)

minor

aee8365

fix

73be039

recompute run with its critical path impact factor

ee82c2b

refine code structure a bit

19f83be

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

5162b00

…pengwa/recompute_in_critical_path

make padding removal work with memory recompute

02c225e

refine codes

e5e30ff

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Mar 28, 2024

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

936b48b

…pengwa/recompute_in_critical_path

Base automatically changed from pengwa/priority_tuning to main March 29, 2024 09:44

pengwa added 2 commits March 29, 2024 02:54

fix

2ee7caa

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

a55dcd8

…pengwa/recompute_in_critical_path

pengwa marked this pull request as ready for review March 29, 2024 10:15

pengwa requested review from wschin, frank-dong-ms, zhijxu-MS and guyang3532 March 29, 2024 10:15

pengwa added 7 commits March 29, 2024 07:09

add NonZero

35183bf

cast propogation && gemm transpose fuse

c33cb83

timestamped priprity based ordering

3e03d1e

Revert "timestamped priprity based ordering"

4b25ad6

This reverts commit 3e03d1e.

Tune single-in-single_out-node-chain for delay execution especially f…

cabb44a

…or input leaf nodes

remove log

1f197dc

disable logging

150cd6e

pengwa closed this Apr 8, 2024

pengwa deleted the pengwa/recompute_in_critical_path branch April 22, 2024 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve priority-based ordering for recompute #20117

Improve priority-based ordering for recompute #20117

pengwa commented Mar 28, 2024 •

edited

Loading

pengwa commented Apr 8, 2024

pengwa commented Apr 8, 2024

Improve priority-based ordering for recompute #20117

Improve priority-based ordering for recompute #20117

Conversation

pengwa commented Mar 28, 2024 • edited Loading